Sepal.Length Petal.Width Species
1 5.1 0.2 setosa
2 4.9 0.2 setosa
3 4.7 0.2 setosa
Module 1: Portfolio Analysis with R
University of South Florida
We are using OCT (Open Collaboration Tools) extension for demonstrations.
You’ll be able to browse my working directory / files freely.
Some essential keybingings may not be available in VSCode.
You can copy paste them to your settings.json file.
A package for manipulating data in R
Especially designed for data.frame object
Greatly simplifies existing functionality in base R
Makes workflow intuitive, readable
Also fast, as key operations are written in C++
tidyverse includes dplyr package|>select() : select subset of columns
rename() : rename columnsrelocate() : change column positionsfilter() : select subset of rows with conditionarrange() : reorder rowsmutate() : add new columns (variables)summarize() : generate summary table based on group_by()select() works on columns (variables).
Sepal.Length Petal.Width Species
1 5.1 0.2 setosa
2 4.9 0.2 setosa
3 4.7 0.2 setosa
Select from “Sepal.Length” (1st) to “Petal.Length” (3rd) column
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
Select columns except for specified columns
Convenience functions are allowed within select().
Examples:
starts_with(), ends_with(), contains()
…and lots more!
matches(), num_range(), all_of(), any_of(), everything(), last_col(), where()
Check tidyselect documentation
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
Rearrange the column order.
Non-mentioned columns retain original ordering.
filter() works on reducing the rows based on the condition.
Example 1: Subset rows with conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Example 2: Multiple conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
2 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
3 0.26 Good E VVS1 57.9 60 554 4.22 4.25 2.45
slice()From diamonds dataframe (load tidyverse for access),
Subset the dataframe with below conditions:
carat is equal to 0.26 and;
clarity is “VS2” and;
select columns “carat”, “cut”, “clarity”
then store it as sub_diamonds
What is the dimension of sub_diamonds? Check with dim(sub_diamonds)
Arrange, or sort the dataframe based on the specified column value.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
Compute transformation of variables and create new column.
\[ \frac{X - \bar{X}}{\sigma_X} = \frac{X - mean(X)}{sd(X)} \]
Many times we are interested in getting summary statistics for groups.
Summarizing is also called as data aggregation
Often used with group_by() , to generate summary
# A tibble: 9 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3 5.9 2.1 virginica
Above code shows the first 3 rows for each Species.
What if we want to summarize across every column?
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
everything() is a selection helper function, like what we have seen
starts_with(), ends_with(),contains()
From diamonds dataframe:
filter() and summarize()min() and max()n()xyz, which is mulplication of x and y and z. Store the dataframe as my_diamond.max(my_diamond$xyz))?cor(x,y) for correlation between x and yThe real world data does not come clean, ready for your analysis.
You will learn a consistent data structure, i.e. Tidy data.
Same data can be presented in various ways. The example data has 4 information:
Data (table1, table2, table3) is is available when you load tidyverse
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
Which data structure is easier to work with, for general purpose?
table1 is Tidy data.A definition of tidy data:
When data is tidy, it is generally easier to work with the data.
If you want to generate a rate of TB per 10,000 population on table1:
# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
It is not simple to operate this with table2 setting.
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Load package tidyverse.
table1 will be ready for access.
group_by() and summarize())# A tibble: 3 × 2
country avg_pop
<chr> <dbl>
1 Afghanistan 20291216.
2 Brazil 173255630
3 China 1276671928.
# A tibble: 3 × 2
country total_cases
<chr> <dbl>
1 Afghanistan 3411
2 Brazil 118225
3 China 426024
For your analysis, you will need to pivot the data (aka reshape):
to longer form (less variables and more observations)
or to wide form (more columns and less rows)
Tip
Whenever you pivot the data, think about columns that are affected, and the names and values.
Let’s take a look at billboard dataset which is in wide form.
Each observation is a song, and we have 76 columns that describe rank.
To tidy up, we want “rank” variable in column that stores the number.
How can we pivot the data so that we have rank in one column?
Simple illustration on how pivot_longer() works:
| artist | track | date.entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk6 | 94 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk7 | 99 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk8 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk9 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk10 | NA |
billboard data withartist, track the columns that are “wk1” to “wk12”num_range() in select()?num_rangepivot_wider() works in opposite way in that:
increases the number of columns (variables)
decreases the number of rows (observations)
Wider forms are common for machine learning.
Previous billboard_longer data:
# A tibble: 5 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
Pivot the data to wider form:
# A tibble: 5 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Note that we need at least two inputs for pivot_wider().
Stock price data commonly comes in the form below:
| symbol | date | open | high | low | close | volume | adjusted |
|---|---|---|---|---|---|---|---|
| TSLA | 2023-01-03 | 118.47 | 118.80 | 104.64 | 108.10 | 231402800 | 108.10 |
| TSLA | 2023-01-04 | 109.11 | 114.59 | 107.52 | 113.64 | 180389000 | 113.64 |
| TSLA | 2023-01-05 | 110.51 | 111.75 | 107.16 | 110.34 | 157986300 | 110.34 |
| TSLA | 2023-01-06 | 103.00 | 114.39 | 101.81 | 113.06 | 220911100 | 113.06 |
| TSLA | 2023-01-09 | 118.96 | 123.52 | 117.11 | 119.77 | 190284000 | 119.77 |
| TSLA | 2023-01-10 | 121.07 | 122.76 | 114.92 | 118.85 | 167642500 | 118.85 |
Question: What if you wanted to have a column for each stock’s adjusted price, like below?
| date | TSLA | BAC | XOM |
|---|---|---|---|
| 2023-01-03 | 108.10 | 30.97427 | 96.07030 |
| 2023-01-04 | 113.64 | 31.55659 | 96.34991 |
| 2023-01-05 | 110.34 | 31.49190 | 98.50566 |
| 2023-01-06 | 113.06 | 31.80616 | 99.69628 |
| 2023-01-09 | 119.77 | 31.32551 | 97.83819 |
| 2023-01-10 | 118.85 | 31.53811 | 99.29941 |
table1 and table2 available on tidyverse.How can you transform table1 to table2?
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Examine data population available from tidyverse.
Pivot the data to generate output as below.
(Hint: Browse document file ?pivot_wider() and see names_prefix argument.)
| country | year_1995 | year_1996 | year_1997 | year_1998 | year_1999 | year_2000 | year_2001 | year_2002 | year_2003 | year_2004 | year_2005 | year_2006 | year_2007 | year_2008 | year_2009 | year_2010 | year_2011 | year_2012 | year_2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 17586073 | 18415307 | 19021226 | 19496836 | 19987071 | 20595360 | 21347782 | 22202806 | 23116142 | 24018682 | 24860855 | 25631282 | 26349243 | 27032197 | 27708187 | 28397812 | 29105480 | 29824536 | 30551674 |
| Albania | 3357858 | 3341043 | 3331317 | 3325456 | 3317941 | 3304948 | 3286084 | 3263596 | 3239385 | 3216197 | 3196130 | 3179573 | 3166222 | 3156608 | 3151185 | 3150143 | 3153883 | 3162083 | 3173271 |
| Algeria | 29315463 | 29845208 | 30345466 | 30820435 | 31276295 | 31719449 | 32150198 | 32572977 | 33003442 | 33461345 | 33960903 | 34507214 | 35097043 | 35725377 | 36383302 | 37062820 | 37762962 | 38481705 | 39208194 |
| American Samoa | 52874 | 53926 | 54942 | 55899 | 56768 | 57522 | 58176 | 58729 | 59117 | 59262 | 59117 | 58652 | 57919 | 57053 | 56245 | 55636 | 55274 | 55128 | 55165 |
| Andorra | 63854 | 64274 | 64090 | 63799 | 64084 | 65399 | 68000 | 71639 | 75643 | 79060 | 81223 | 81877 | 81292 | 79969 | 78659 | 77907 | 77865 | 78360 | 79218 |
| Angola | 12104952 | 12451945 | 12791388 | 13137542 | 13510616 | 13924930 | 14385283 | 14886574 | 15421075 | 15976715 | 16544376 | 17122409 | 17712824 | 18314441 | 18926650 | 19549124 | 20180490 | 20820525 | 21471618 |
| Anguilla | 9807 | 10063 | 10305 | 10545 | 10797 | 11071 | 11371 | 11693 | 12023 | 12342 | 12637 | 12903 | 13145 | 13365 | 13571 | 13768 | 13956 | 14132 | 14300 |
| Antigua and Barbuda | 68349 | 70245 | 72232 | 74206 | 76041 | 77648 | 78972 | 80030 | 80904 | 81718 | 82565 | 83467 | 84397 | 85349 | 86300 | 87233 | 88152 | 89069 | 89985 |
| Argentina | 34833168 | 35264070 | 35690778 | 36109342 | 36514558 | 36903067 | 37273361 | 37627545 | 37970411 | 38308779 | 38647854 | 38988923 | 39331357 | 39676083 | 40023641 | 40374224 | 40728738 | 41086927 | 41446246 |
| Armenia | 3223173 | 3173425 | 3137652 | 3112958 | 3093820 | 3076098 | 3059960 | 3047002 | 3036032 | 3025652 | 3014917 | 3002911 | 2989882 | 2977488 | 2968154 | 2963496 | 2964120 | 2969081 | 2976566 |
Financial data encompasses information related to financial markets, instruments, and economic indicators. It includes:
The risk-return tradeoff is a fundamental concept in finance
Return Calculation for Individual Assets:
The holding period return of an asset (\(r\)) over a period is calculated using the formula:
\[ r = \frac{P_{end} - P_{begin} + D}{P_{begin}} \]
\(r\) is the rate of return for the holding period
\(P_{end}\) is the ending price of the asset
\(P_{begin}\) is the beginning price of the asset
\(D\) represents any dividends or income received during the period
Tip
The return on a portfolio is a weighted sum of the individual returns of the assets within the portfolio.
The return of a portfolio (\(r_p\)) is calculated as:
\[ r_p = \sum_{i=1}^{n} w_i r_i \]
\(r_p\) is the return of the portfolio over the period
\(w_i\) is the weight of asset \(i\) in the portfolio
\(r_i\) is the return of asset \(i\) over the period
\(n\) is the number of assets in the portfolio
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual return of the portfolio.
Stand-alone risk considers the risk of a single asset independently
The formula for risk (\(\sigma\)) is:
\[ \sigma = \sqrt{\frac{\sum (r_t - \bar{r})^2}{T - 1}} = sd(r) \]
Where:
Portfolio risk involves the risk associated with holding a portfolio of assets.
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
The annual standard deviation of returns for asset A is 10% (0.1), and for asset B, it’s 15% (0.15). The correlation coefficient between the returns of assets A and B is 0.5.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual risk of the portfolio.
\[ σ_p^2=w_A^2σ_A^2+w_B^2σ_B^2+2w_Aw_Bσ_Aσ_BρXY \]
At each time \(t\), calculate the return of a portfolio (\(r_p\))
\[ r_{p,t} = \sum_{i=1}^{n} w_{i,t} r_{i,t} \]
Then calculate standard deviation of portfolio returns (\(\sigma_p\)) over \(t\)
\[ \sigma_p = sd(r_p) = \sqrt{\frac{\sum (r_{p,t} - \bar{r_p})^2}{T - 1}} \]
# A tibble: 6 × 3
symbol date adjusted
<chr> <date> <dbl>
1 TSLA 2020-01-02 28.7
2 TSLA 2020-01-03 29.5
3 TSLA 2020-01-06 30.1
4 TSLA 2020-01-07 31.3
5 TSLA 2020-01-08 32.8
6 TSLA 2020-01-09 32.1
# A tibble: 6 × 3
# Groups: symbol [1]
symbol date daily_return
<chr> <date> <dbl>
1 BAC 2020-01-03 -0.0208
2 BAC 2020-01-06 -0.00143
3 BAC 2020-01-07 -0.00660
4 BAC 2020-01-08 0.0101
5 BAC 2020-01-09 0.00172
6 BAC 2020-01-10 -0.00828
We will use pivot technique, to deviate from tidy form.
# A tibble: 6 × 4
date BAC TSLA XOM
<date> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804
2 2020-01-06 -0.00143 0.0193 0.00768
3 2020-01-07 -0.00660 0.0388 -0.00818
4 2020-01-08 0.0101 0.0492 -0.0151
5 2020-01-09 0.00172 -0.0219 0.00766
6 2020-01-10 -0.00828 -0.00663 -0.00889
Drop row with missing values
# A tibble: 6 × 5
date BAC TSLA XOM port_ret
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804 0.000718
2 2020-01-06 -0.00143 0.0193 0.00768 0.00933
3 2020-01-07 -0.00660 0.0388 -0.00818 0.00623
4 2020-01-08 0.0101 0.0492 -0.0151 0.00924
5 2020-01-09 0.00172 -0.0219 0.00766 -0.00241
6 2020-01-10 -0.00828 -0.00663 -0.00889 -0.00809
Derived from portfolio theory and optimization.
\[ E(r_i)=r_f+β_i(E(r_m)−r_f)\]
Where:
\(E(r_i)\) is the expected return of the investment
\(r_f\) is the risk-free rate
\(\beta_i\) is the beta of the investment
\(E(r_m)\) is the expected return of the market
\((E(r_m) - r_f)\) is known as the market risk premium
Beta is a measure of the sensitivity of an individual investment’s returns to the market.
\[ \beta_i = \frac{Cov(r_i - r_f, r_m-r_f)}{Var(r_m-r_f)} \]
\[ r_i - r_f = \beta_0 + \beta_1(r_m - r_f)+ e_i \]
Tip
In some specific cases (e.g., intraday frequency) beta is estimated with raw return (\(r_i\) and \(r_m\)), not excess returns (\(r_i - r_f\))
Prepare a stock return, market return (index).
Then combine two dataframe (bind them in row-wise).
tickers <- c("MSFT")
stock_prices <- tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
snp500 <- tq_get(
"SP500",
get = "economic.data",
from = "2020-01-01",
to = "2023-12-31"
)
snp500 <- snp500 |> rename(adjusted = price) # rename column for binding
prices <- bind_rows(stock_prices, snp500)
prices |> glimpse()Rows: 2,049
Columns: 8
$ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
$ date <date> 2020-01-02, 2020-01-03, 2020-01-06, 2020-01-07, 2020-01-08, …
$ open <dbl> 158.78, 158.32, 157.08, 159.32, 158.93, 161.84, 162.82, 161.7…
$ high <dbl> 160.73, 159.95, 159.10, 159.67, 160.80, 162.22, 163.22, 163.3…
$ low <dbl> 158.33, 158.06, 156.51, 157.32, 157.95, 161.03, 161.18, 161.2…
$ close <dbl> 160.62, 158.62, 159.03, 157.58, 160.09, 162.09, 161.34, 163.2…
$ volume <dbl> 22622100, 21116200, 20813700, 21634100, 27746500, 21385000, 2…
$ adjusted <dbl> 152.5057, 150.6067, 150.9960, 149.6193, 152.0025, 153.9014, 1…
Pivot price data into wide form:
# A tibble: 6 × 3
date MSFT SP500
<date> <dbl> <dbl>
1 2020-01-02 153. 3258.
2 2020-01-03 151. 3235.
3 2020-01-06 151. 3246.
4 2020-01-07 150. 3237.
5 2020-01-08 152. 3253.
6 2020-01-09 154. 3275.
Generate stock and index returns
Prepare risk-free rate (use 3 month treasury bill rate)
The risk-free rate is in percentage term (%) and annualized.
Convert the rate into daily level
# A tibble: 6 × 4
symbol date price rf
<chr> <date> <dbl> <dbl>
1 DGS3MO 2020-01-01 NA NA
2 DGS3MO 2020-01-02 1.54 0.0000611
3 DGS3MO 2020-01-03 1.52 0.0000603
4 DGS3MO 2020-01-06 1.56 0.0000619
5 DGS3MO 2020-01-07 1.54 0.0000611
6 DGS3MO 2020-01-08 1.54 0.0000611
Convert to wide form:
# A tibble: 6 × 2
date DGS3MO
<date> <dbl>
1 2020-01-01 NA
2 2020-01-02 0.0000611
3 2020-01-03 0.0000603
4 2020-01-06 0.0000619
5 2020-01-07 0.0000611
6 2020-01-08 0.0000611
Join two dataframe column-wise
# A tibble: 3 × 4
date MSFT SP500 DGS3MO
<date> <dbl> <dbl> <dbl>
1 2020-01-02 NA NA 0.0000611
2 2020-01-03 -0.0125 -0.00706 0.0000603
3 2020-01-06 0.00258 0.00353 0.0000619
Then generate excess returns:
Use lm() for linear regression fit.
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Coefficients:
(Intercept) Mkt_exret
0.0005149 1.1718519
To browse summary of regression:
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Residuals:
Min 1Q Median 3Q Max
-0.068986 -0.006335 -0.000456 0.006333 0.076456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005149 0.0003655 1.409 0.159
Mkt_exret 1.1718519 0.0251585 46.579 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01154 on 996 degrees of freedom
(45 observations deleted due to missingness)
Multiple R-squared: 0.6854, Adjusted R-squared: 0.6851
F-statistic: 2170 on 1 and 996 DF, p-value: < 2.2e-16
How to access and extract coefficient estimates?
(Intercept) Mkt_exret
0.0005148839 1.1718518803
[1] "double"
To extract beta estimate:
CAPM estimate (or any other) is largely dependent on the timeframe of your choice.
Consider following:
Should they be similar? Usually not.
Using our previous example, estimate the CAPM beta with the following:
Given a set “lookback” period, estimate should only use previous information.
To capture timely information at each period, rolling regressions are often performed.
Generate stock prices with below.
Using adjusted daily closing prices, calculate the expected (average) return and risk of
a equal-weighted portfolio
30% on Apple and 70% on Microsoft
70% on Apple and 30% on Microsoft
Based on the portfolio return (70% on Apple and 30% on Microsoft), calculate the portfolio beta.
Importing data is crucial initial process for any data science project.
We will learn how to read external data to R, in data.frame object.
Also how to write data.frame in R to a local file.
CSV: Comma-Seperated Values
A plain, human-readable text data file
Minimalistic, widely used
Typically opened with Excel, but it is not an excel file!
Since it is text, R tries to “guess” the type of each column when importing
A csv representation of iris dataframe:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
When saved to csv file:
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa
Write and read csv
Many packages support writing/reading csv files;
utils package): basic but slowreadr from tidyverse: fast, functionalvroom, data.table: extremely fast and functionalWe use readr package and will discuss I/O speed later.
To write a data.frame to a csv file: write_csv()
To read a .csv file to a data.frame: read_csv()
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can specify a downloadable url instad:
# A tibble: 6 × 9
mpg cylinders displacement horsepower weight acceleration model_year origin
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 18 8 307 130 3504 12 70 usa
2 15 8 350 165 3693 11.5 70 usa
3 18 8 318 150 3436 11 70 usa
4 16 8 304 150 3433 12 70 usa
5 17 8 302 140 3449 10.5 70 usa
6 15 8 429 198 4341 10 70 usa
# ℹ 1 more variable: name <chr>
To read a .xlsx file to a data.frame: read_excel() from readxl package
# A tibble: 6 × 11
`in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Starting Line 19658 31398 42206 47681 33448 33615
2 Depreciation & Amortiza… 11917 14760 15583 15888 17182 17297
3 Non-Cash Items -1474 129 262 3313 482 3110
4 Change in Working Capit… -1663 2126 -2706 -10712 -6198 -8906
5 Cash from Operating Act… 28438 48413 55345 56170 44914 45116
6 Change in Fixed Assets … -22491 -26871 -30975 -34271 -33669 -32952
# ℹ 4 more variables: `FY '15` <dbl>, `FY '16` <dbl>, `FY '17` <dbl>,
# `FY '18` <dbl>
There are other common data formats:
janitor package offers simple variable name cleaner: clean_names().
# A tibble: 6 × 11
in_million_usd fy_09 fy_10 fy_11 fy_12 fy_13 fy_14 fy_15 fy_16 fy_17
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Sta… 19658 31398 42206 47681 33448 33615 16551 8375 19848
2 Depreciation &… 11917 14760 15583 15888 17182 17297 18048 22308 19893
3 Non-Cash Items -1474 129 262 3313 482 3110 -750 -5313 -7921
4 Change in Work… -1663 2126 -2706 -10712 -6198 -8906 -3505 -3288 -1754
5 Cash from Oper… 28438 48413 55345 56170 44914 45116 30344 22082 30066
6 Change in Fixe… -22491 -26871 -30975 -34271 -33669 -32952 -26490 -16163 -15402
# ℹ 1 more variable: fy_18 <dbl>
janitor and readxlread_excel() and store as exxon_statementexxon_categories.csv file with read_csv() as exxon_categories.exxon_categories to exxon_statement as:in_million_usd to Account.exxon_long with below arguments:starts_with()Categorical variables (factors) take a predefined set of values.
To visualize distribution of categorical variable, bar plots are often used.
geom_bar() and geom_col()geom_bar() : When you need count of single category
geom_col() : When you need different Y
The end goal of the bar plot:
To fill the color, use “fill”. For border color, use “color”.
To change scientific notation to currency format: use scales package.
To add more breaks (ticks) on Y axis, control n.breaks in continuous scale.
To add axis labels and plot title, use labs() function.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
)By default, ggplot uses gray background. To change, you can use other built-in themes.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()To reorder factor variable according to its value: fct_reorder() on the variable.
total_amt_bycat |>
ggplot(aes(x = fct_reorder(Category, -Total_USD) , y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()Numeric variable has infinite range of values.
Commonly used visualization is histogram
An alternative for numeric distribution is density plot.
Visualizing two (or more) variables can be useful to show variable relationships.
Barplots can be used to visualize two categorical variables.
Scatterplots and regressions can effectively visualize relationships.
Time series plots have date variable on X axis.
Line plots are often used.
Candlestick price charts are often used as well:
FIN4773: Big Data and Machine Learning in Finance